home *** CD-ROM | disk | FTP | other *** search
Text File | 1996-07-05 | 23.1 KB | 488 lines | [TEXT/R*ch] |
- clustalw_help for version 1.4 (September 1994).
-
- This is the on-line help file for CLUSTAL W.
-
- It should be named or defined as: clustalw_help
- except with MSDOS in which case it should be named CLUSTALW.HLP
-
- For full details of usage and algorithms, please see the files:
- cluustalv.doc The documentation for Clustal V (most of the program usage
- and the basic algorithms are the same).
- clustalw.ms A manuscript describing the main algorithmic changes over
- Clustal V.
- readme.txt A brief summary of the main changes over Clustal V.
-
-
- Toby Gibson
- Des Higgins (now at the EBI, Hinxton, Great Britain)
- Julie Thompson
-
- EMBL, Heidelberg, Germany.
-
-
- The paper describing Clustal W is:
-
- Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994)
- CLUSTAL W: improving the sensitivity of progressive multiple
- sequence alignment through sequence weighting, position specific
- gap penalties and weight matrix choice.
- Nucleic Acids Research, submitted, June 1994.
-
-
- >>HELP 1 << General help for CLUSTAL W
-
- Clustal W is a general purpose multiple alignment program for DNA or proteins.
-
- SEQUENCE INPUT: all sequences must be in 1 file, one after another.
- 6 formats are automatically recognised: NBRF/PIR, EMBL/SWISSPROT,
- Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup) and GDE.
- All non-alphabetic characters (spaces, digits, punctuation marks) are ignored
- except "-" which is used to indicate a GAP ("." in GCG/MSF).
-
-
- To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to
- INPUT them; go to menu item 2 to do the multiple alignment.
-
- PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments or to add a set
- of new sequences to an old alignment. Use this to add new sequences to an old
- alignment. GAPS in the old alignments are
- indicated using the "-" character. PROFILES can be input in ANY of the
- allowed formats; just use "-" (or "." for MSF) for each gap position.
-
- PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in
- with "-" characters to indicate gaps) OR after a multiple alignment while the
- alignment is still in memory.
-
-
- The program tries to automatically recognise the different file formats used
- and to guess whether the sequences are amino acid or nucleotide. This is not
- always foolproof.
-
- FASTA and NBRF/PIR formats are recognised by having a ">" as the first
- character in the file.
-
- EMBL/Swiss Prot formats are recognised by the letters
- ID at the start of the file (the token for the entry name field).
-
- CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.
-
- GCG/MSF format is recognised by the word PileUp at the start of the file. If
- your msf files do not contain this word first, edit it in at the start
- of the first line.
-
- If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the
- sequence will be assumed to be nucleotide. This works in 97.3% of cases
- but watch out!
-
-
-
-
- The paper describing Clustal W is:
-
- Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994)
- CLUSTAL W: improving the sensitivity of progressive multiple
- sequence alignment through sequence weighting, position specific
- gap penalties and weight matrix choice.
- Nucleic Acids Research, submitted, June 1994.
-
-
-
- >>HELP 2 << Help for multiple alignments
-
- If you have already loaded sequences, use menu item 1 to do the complete
- multiple alignment. You will be prompted for 2 output files: 1 for the
- alignment itself; another to store a dendrogram that describes the similarity
- of the sequences to each other.
-
- Multiple alignments are carried out in 3 stages (automatically done from menu
- item 1 ...Do complete multiple alignments now):
-
- 1) all sequences are compared to each other (pairwise alignments);
-
- 2) a dendrogram (like a phylogenetic tree) is constructed, describing the
- approximate groupings of the sequences by similarity (stored in a file).
-
- 3) the final multiple alignment is carried out, using the dendrogram as a guide.
-
-
- PAIRWISE ALIGNMENT parameters control the speed/sensitivity of the initial
- alignments.
-
- MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments.
-
- RESET GAPS (menu item 7) will remove any new gaps introduced into the sequences
- during multiple alignment if you wish to change the parameters and try again.
- This only takes effect just before you do a second multiple alignment. You
- can make phylogenetic trees after alignment whether or not this is ON.
- If you turn this OFF, the new gaps are kept even if you do a second multiple
- alignment. This allows you to iterate the alignment gradually. Sometimes, the
- alignment is improved by a second or third pass.
-
- SCREEN DISPLAY can be used to send the output alignments to the screen
- as well as to the output file.
-
- You can skip the first stages (pairwise alignments; dendrogram) by using an
- old dendrogram file (menu item 3); or you can just produce the dendrogram
- with no final multiple alignment (menu item 2).
-
- OUTPUT FORMAT: Menu item 9 (format options) allows you to choose from 5
- different alignment formats (CLUSTAL, GCG, NBRF/PIR, PHYLIP and GDE).
-
- You can toggle between FAST/APPROXIMATE or SLOW/ACCURATE alignments for
- the initial alignments used to make the guide tree. The fast ones are
- extremely fast but are less reliable than the slow ones.
- >>HELP 3 << Help for pairwise alignment parameters
- A distance is calculated between every pair of sequences and these are
- used to construct the dendrogram which guides the final multiple alignment.
- The scores are calculated from separate pairwise alignments. These can be
- calculated using 2 methods: dynamic programming (slow but accurate) or by the
- method of Wilbur and Lipman (extremely fast but approximate).
-
- You can choose between the 2 alignment methods using menu option 8. The
- slow/accurate method is fine for short sequences but will be VERY SLOW
- for many (e.g. >20) long (e.g. >1000 residue) sequences.
-
-
- SLOW/ACCURATE alignment parameters:
-
- These parameters do not have any affect on the speed of the alignments. They
- are used to give initial alignments which are then rescored to give percent
- identity scores. These % scores are the ones which are displayed on the
- screen. The scores are converted to distances for the trees.
-
- 1) Gap Open Penalty: the penalty for opening a gap in the alignment.
- 2) Gap extension penalty: the penalty for extending a gap by 1 residue.
- 3) Protein weight matrix: the scoring table which describes the similarity of
- each amino acid to each other. For DNA, an identity matrix is used.
-
-
-
- FAST/APPROXIMATE alignment parameters:
-
- These similarity scores are calculated from fast, approximate, global align-
- ments, which are controlled by 4 parameters. 2 techniques are used to make
- these alignments very fast: 1) only exactly matching fragments (k-tuples) are
- considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)
- are used.
-
-
- K-TUPLE SIZE: This is the size of exactly matching fragment that is used.
- INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.
- For longer sequences (e.g. >1000 residues) you may need to increase the default.
-
-
- GAP PENALTY: This is a penalty for each gap in the fast alignments. It has
- little affect on the speed or sensitivity except for extreme values.
-
-
-
-
-
-
- TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary
- dot-matrix plot) is calculated. Only the best ones (with most matches) are
- used in the alignment. This parameter specifies how many. Decrease for speed;
- increase for sensitivity.
-
-
- WINDOW SIZE: This is the number of diagonals around each of the 'best'
- diagonals that will be used. Decrease for speed; increase for sensitivity.
-
-
- >>HELP 4 << Help for multiple alignment parameters
- These parameters control the final multiple alignment. This is the core of
- the program and the details are complicated. To fully understand the use
- of the parameters and the scoring system, you will have to refer to the
- documentation.
-
- Each step in the final multiple alignment consists of aligning two alignments
- or sequences. This is done progressively, following the branching order in
- the GUIDE TREE. The basic parameters to control this are two gap penalties and
- the scores for various identical/non-indentical residues.
-
- 1) and 2) The GAP PENALTIES are set by menu items 1 and 2. These control the
- cost of opening up every new gap and the cost of every item in a gap.
- Increasing the gap opening penalty will make gaps less frequent. Increasing
- the gap extension penalty will make gaps shorter. Terminal gaps are not
- penalised.
-
- 3) The DELAY DIVERGENT SEQUENCES switch delays the alignment of the most
- distantly related sequences until after the most closely related sequences have
- been aligned. The setting shows the percent identity level required to delay
- the addition of a sequence; sequences that are less identical than this level
- to any other sequences will be aligned later.
-
-
-
- 4) For DNA, the scoring system assigns a score of 3 for two identical bases
- and zero otherwise. The TOGGLE TRANSITIONS switch (menu item 3) gives
- transitions (A <--> G or C <--> T i.e. purine-purine or pyrimidine-pyrimidine
- substitutions) a score of 1; otherwise, these are scored as mismatches and
- get a score of zero. For distantly related DNA sequences, this switch
- might be better turned off; for closely related sequences it can be useful.
-
- 5) PROTEIN WEIGHT MATRIX leads to a new menu where you are offered a
- choice of weight matrices. The default is the BLOSUM series of
- matrices by Jorja and Steven Henikoff. Note, a series is used! The actual
- matrix that is used depends on how similar the sequences to be aligned at this
- alignment step are. Different matrices work differently at each
- evolutionary distance. Further help is offered in the weight matrix menu.
-
- >>HELP A << Help for protein gap parameters.
- 1) RESIDUE SPECIFIC PENALTIES are amino acid specific gap penalties that reduce
- or increase the gap opening penalties at each position in the alignment or
- sequence. See the documentation for details. As an example, positions that
- are rich in glycine are more likely to have an adjacent gap than positions that
- are rich in valine.
-
- 2) 3) HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within
- a run (5 or more residues) of hydrophilic amino acids; these are likely to
- be loop or random coil regions where gaps are more common. The residues that
- are "considered" to be hydrophilic are set by menu item 3.
-
- 4) GAP SEPARATION DISTANCE tries to decrease the chances of gaps being
- too close to each other. Gaps that are less than this distance apart
- are penalised more than other gaps. This does not prevent close gaps;
- it makes them less frequent, promoting a block-like appearance of the alignment.
-
- 5) END GAP SEPARATION treats end gaps just like internal gaps for the purposes
- of avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above).
- If this is off (default), end gaps will be ignored for this purpose. This is
- useful when you wish to align fragments where the end gaps are not biologically
- meaningful.
- >>HELP 5 << Help for output format options.
- Five output formats are offered. You can choose more than one (or all 5 if
- you wish).
-
- CLUSTAL format output is a self explanatory alignment format. It shows the
- sequences aligned in blocks. It can be read in again at a later date to
- (for example) calculate a phylogenetic tree or add a new sequence with a
- profile alignment.
-
- GCG output can be used by any of the GCG programs that can work on multiple
- alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG
- .msf format files (multiple sequence file); new in version 7 of GCG.
-
- PHYLIP format output can be used for input to the PHYLIP package of Joe
- Felsenstein. This is an extremely widely used package for doing every
- imaginable form of phylogenetic analysis (MUCH more than the the modest intro-
- duction offered by this program).
-
- NBRF/PIR: this is the same as the standard PIR format with ONE ADDITION. Gap
- characters "-" are used to indicate the positions of gaps in the multiple
- alignment. These files can be re-used as input in any part of clustal that
- allows sequences (or alignments or profiles) to be read in.
-
- GDE: this format is used by the GDE package of Steven Smith.
-
-
- OUTPUT ORDER is used to control the order of the sequences in the output
- alignments. By default, it is the same as the input order. This switch can
- be used to make the order correspond to the order in which the sequences
- were aligned (from the guide tree/dendrogram), thus automatically grouping
- closely related sequences.
- >>HELP 6 << Help for profile alignments
-
- By PROFILE ALIGNMENT, we mean alignment to an existing alignment. Either of the
- alignments can be a single sequence. A profile is simply an alignment of
- one or more sequences (e.g. an alignment output file from Clustal W) or a set
- of unaligned sequences.
-
- The profiles can be in any of the allowed input formats with "-" characters
- used to specify gaps (except for GCG/MSF where "." is used).
-
- You have to specify the 2 profiles by choosing menu items 1 and 2 and giving
- 2 file names. Then Menu item 3 will align the 2 profiles to each other.
-
- Menu item 4 will take the sequences in the second profile and align them to
- the first profile, 1 at a time. This is useful to add some new sequences to
- an existing alignment. In this case, the second profile need not be pre-
- aligned.
-
- The alignment parameters can be set using menu items 6 and 7 ("Alignment
- parameters"). These are EXACTLY the same parameters as used by the general,
- automatic multiple alignment procedure. The general multiple alignment proc-
- edure is simply a series of profile alignments. Carrying out a series of
- profile alignments on larger and larger groups of sequences, allows you to
- manually build up a complete alignment.
-
- Profile alignments allow you to store alignments of your favourite sequences
- and add new sequences to them in small bunches at a time.
- >>HELP 7 << Help for phylogenetic trees
- 1) Before calculating a tree, you must have an ALIGNMENT in memory. This can be
- input in any format or you should have just carried out a full multiple
- alignment and the alignment is still in memory. Remember YOU MUST ALIGN THE
- SEQUENCES FIRST!!!!
-
- The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First
- you calculate distances (percent divergence) between all pairs of sequence from
- a multiple alignment; second you apply the NJ method to the distance matrix.
-
- 2) EXCLUDE POSITIONS WITH GAPS? With this option, any alignment positions
- where ANY of the sequences have a gap will be ignored. This means that 'like'
- will be compared to 'like' in all distances. It also, automatically throws
- away the most ambiguous parts of the alignment, which are concentrated around
- gaps (usually). The disadvantage is that you may throw away much of
- the data if there are many gaps.
-
- 3) CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this
- option makes no difference. For greater divergence, this option corrects
- for the fact that observed distances underestimate actual evolutionary dist-
- ances. This is because, as sequences diverge, more than one substitution will
- happen at many sites. However, you only see one difference when you look at the
- present day sequences. Therefore, this option has the effect of stretching
- branch lengths in trees (especially long branches). The corrections used here
- (for DNA or proteins) are both due to Motoo Kimura. See the documentation for
- details. README.TXT describes a new modification for proteins distances.
-
- For VERY divergent sequences, the distances cannot be reliably
- corrected. You will be warned if this happens. Even if none of the distances
- in a data set exceed the reliable threshold, if you bootstrap the data,
- some of the bootstrap distances may randomly exceed the safe limit.
-
-
- 4) To calculate a tree, use option 4 (DRAW TREE NOW). This gives an UNROOTED
- tree and all branch lengths. The root of the tree can only be inferred by
- using an outgroup (a sequence that you are certain branches at the outside
- of the tree .... certain on biological grounds) OR if you assume a degree
- of constancy in the 'molecular clock', you can place the root in the 'middle'
- of the tree (roughly equidistant from all tips).
-
- 5) BOOTSTRAPPING is a method for deriving confidence values for the groupings in
- a tree (first adapted for trees by Joe Felsenstein). It involves making N
- random samples of sites from the alignment (N should be LARGE, e.g. 500 - 1000);
- drawing N trees (1 from each sample) and counting how many times each grouping
- from the original tree occurs in the sample trees. You must supply a seed
- number for the random number generator. Different runs with the same seed
- will give the same answer. See the documentation for details.
-
- 6) OUTPUT FORMATS: three different formats are allowed. None of these
- displays the tree visually. You must make the tree yourself (on paper)
- using the results OR get the PHYLIP package and use the tree drawing facilities
- there. (Get the PHYLIP package anyway if you are interested in trees).
-
- >>HELP 8 << Help for choosing protein weight matrix
- For protein alignments, you use a weight matrix to determine the similarity of
- non-identical amino acids. For example, Tyr aligned with Phe is usually judged
- to be 'better' than Tyr aligned with Pro. These are not used with DNA.
-
- There are two 'in-built' series of weight matrices offered. Each consists
- of several matrices which work differently at different evolutionary distances.
- To see the exact details, read the documentation. Crudely, we store several
- matrices in memory, spanning the full range of amino acid distance (from
- almost identical sequences to highly divergent ones). For very similar
- sequences, it is best to use a strict weight matrix which only gives a high
- score to identities and the most favoured conservative substitutions. For
- more divergent sequences, it is appropriate to use "softer" matrices which
- give a high score to many other frequent substitutions.
-
- 1) BLOSUM (Henikoff). These matrices appear to be the best available for
- carrying out data base similarity (homology searches). The matrices used are:
- Blosum80, 62, 40 and 30.
-
- 2) PAM (Dayhoff). These have been extremely widely used since the late '70s.
- We use the PAM 120, 160, 250 and 350 matrices.
-
- We also supply an identity matrix which gives a score of 10 to two identical
- amino acids and a score of zero otherwise. This matrix is not very useful.
- Alternatively, you can read in your own (just one matrix, not a series).
-
- A new matrix can be read from a file on disk, if the filename consists only
- of lower case characters. The values in the new weight matrix must be integers
- and the scores should be similarities. You can use negative as well as positive
- values if you wish, although the matrix will be automatically adjusted to all
- positive scores.
-
- INPUT FORMAT The format used for a new matrix is the same as the BLAST program.
- Any lines beginning with a # character are assumed to be comments. The first
- non-comment line should contain a list of amino acids in any order, using the
- 1 letter code, followed by a * character. This should be followed by a square
- matrix of integer scores, with one row and one column for each amino acid. The
- last row and column of the matrix (corresponding to the * character) contain
- the minimum score over the whole matrix.
- >>HELP 9 << Help for command line parameters
- DATA (sequences)
-
- /INFILE=file.ext :input sequences.
- /PROFILE1=file.ext and /PROFILE2=file.ext :profiles (old alignment).
-
- VERBS (do things)
-
- /OPTIONS :list the command line parameters
- /HELP or /CHECK :outline the command line params.
- /ALIGN :do full multiple alignment
- /TREE :calculate NJ tree.
- /BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps; def. = 1000).
-
- PARAMETERS (set things)
-
- ***General settings:****
- /INTERACTIVE :read command line, then enter normal interactive menus
- /QUICKTREE :use FAST algorithm for the alignment guide tree
- /NEWTREE= :file for new guide tree
- /USETREE= :file for old guide tree
- /NEGATIVE :protein alignment with negative values in matrix
- /OUTFILE= :sequence alignment file name
- /OUTPUT= :GCG, GDE, PHYLIP or PIR
- /OUTORDER= :INPUT or ALIGNED
- /CASE :LOWER or UPPER (for GDE output only)
-
- ***Fast Pairwise Alignments:***
- /KTUP=n :word size /TOPDIAGS=n :number of best diags.
- /WINDOW=n :window around best diags. /PAIRGAP=n :gap penalty
- /SCORE :PERCENT or ABSOLUTE
-
- ***Slow Pairwise Alignments:***
- /PWMATRIX= :BLOSUM, PAM, ID or filename
- /PWGAPOPEN=f :gap opening penalty /PWGAPEXT=f :gap opening penalty
-
- ***Multiple Alignments:***
- /MATRIX= :BLOSUM, PAM, ID or filename
- /GAPOPEN=f :gap opening penalty /GAPEXT=f :gap extension penalty
- /ENDGAPS :no end gap separation pen. /GAPDIST=n :gap separation pen. range
- /NORGAP :Residue specific gaps off /NOHGAP :hydrophilic gaps off
- /HGAPRESIDUES= :list hydrophilic res. /MAXDIV=n :% ident. for delay
- /TYPE= :PROTEIN or DNA /TRANSITIONS :transitions NOT weighted.
-
- ***Trees:*** /SEED=n :seed number for bootstraps.
- /KIMURA :use Kimura's correction. /TOSSGAPS :ignore positions with gaps.
-
- >>HELP 0 << Help for tree output format options
-
- Three output formats are offered: 1) Clustal, 2) Phylip/TreeTool,
- 3) Just the distances.
-
- None of these formats displays the results graphically. To see a graphic
- representation of a tree (not a bootstrapped tree), get the PHYLIP package and
- use format 2) below. It can be imported into the PHYLIP programs RETREE,
- DRAWTREE and DRAWGRAM and displayed graphically. TreeTool can also do this
- but is only available for SUN (by ftp from rdp.life.uiuc.edu). TreeTool,
- however has a neat facility for labels on internal nodes which we use to
- display bootstrap figures on the bootstrap trees. If you do not have TreeTool,
- please request the trees in Clustal format 1) below.
-
-
- 1) Clustal format output.
- This format is verbose and lists all of the distances between the sequences
- and the number of alignment positions used for each. The tree is described
- at the end of the file. It lists the sequences that are joined at each
- alignment step and the branch lengths. After two sequences are joined, it is
- referred to later as a NODE. The number of a NODE is the number of the
- lowest sequence in that NODE.
-
- 2) Phylip or TreeTool format output.
- This format is the New Hampshire format, used by many phylogenetic analysis
- packages. It consists of a series of nested parentheses, describing the
- branching order, with the sequence names and branch lengths. With a simple
- tree, it can be used by the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP
- package to see the trees graphically. This is the same format used during
- multiple alignment for the guide trees.
-
- With a bootstrap tree, you need to use TreeTool or request format 1) above.
-
- 3) The distances only.
- This format just outputs a matrix of all the pairwise distances in a format
- that can be used by the Phylip package. It used to be useful when one
- could not produce distances from protein sequences in the Phylip package but
- is now redundant (Protdist of Phylip 3.5 now does this).
-